Abstract
In this tutorial, we will go through the Seurat integration across modalities vignette and demonstrate how the Canonical Correlation Analysis (CCA) can be used for integrating scRNAseq and scATACseq data from ~10K PBMC cells sequenced with 10X technology.
Seurat is a tool developed by the lab of Rahul Satija to facilitate analysis of Single Cell Omics (scOmics) data. Started as a pipeline tool, i.e. a collection of all known pre-processing and scOmics analysis steps, Seurat developed a few original methods on normalization and batch-effect correction. With the rise of Human Cell Atlas (HCA) consortium, Seurat contributed to ideas on data harmonization across multiple labs and technologies. The ambition of HCA was to develop an Atlas of all human cells from all human tissues, that could serve as a reference for the future research, i.e. human cells from a particular experiment could have been quickly assigned to a particular cell type without performing the regular single cell analysis steps.
Seurat was for a very long time de-facto a standard tool for single cell data analysis in 2016-2018, especially in North America, before the first two Seurat articles got published in 2018 and 2019. Nowadays, there are alternative single cell analysis workflows, such as SCRAN (lab of John Marioni at EBI) and SCANPY (lab of Fabian Thejs in Munich), that can compete with Seurat. In the first paper of A. Butler et al., Integrating single cell transcriptomic data across different conditions, technilogies and species in Nature Biotechnology, 2018, Seurat suggested an interesting modification of the Canonical Correlation Analysis (CCA), that belongs to the same family of algorithms as PLS, OPLS, JIVE and DISCO. The modification was to include an alignment of canonical correlation vectors (PLS-components) with the Dynamic Time Warping (DTW), which typically serves as a trajectory similarity measure in Time Series data analysis.
The idea of data integration by Seurat is that CCA delivers components representing linear combinations of features across data sets that are maximally correlated (capture correlation structures across data sets), but not necessarily aligned. Next, Dynamic Time Warping (DTW) is used to locally compress or stretch the vectors during alignment to correct for changes in population density. As a result, the data sets are represented in a single, integrated low-dimensional space.
The CCA + DTW strategy was successfully used for data harmonization across multiple conditions, species, labs and technologies. However, those examples were a sort of single cell oriented batch-effect correction that has been known as a big problem in data analysis for years. In other words, despite CCA + DTW offer an impressive integrative framework, possibly adjusted for single cell data analysis, this methodology is for integration across samples and not straghtforward to extend to the integration across Omics, which is the main challenge and focus of multi-Omics data integration.
Later, Seurat extended the CCA approach for integrating across Omics in the work Stuart et al. 2019, where they used an interesting “anchor” idea borrowed from the Mutual Nearest Neighbors (MNN) algorithm of Haghverdi et al. 2018. The idea is based on identifying cells across two Omics that are most similar to each other (anchors), i.e. most likely belong to the same cell population, and align all the rest cells accordingly. This was probably the first true integration across Omics for single cell area.
Below, we will demonstrate how to use the “anchors” approach for integrating scRNAseq (9432 cells) and scATACseq (8728 cells) data from PBMC cells. Please note that the two modalities were not measures on the physically same cells but on cells from the same tissue. This was because there was no high-throughput technology available at that time that could produce multiple modalities from the same biological cells. Nowadays, with the advent of 10X Multiome technology, the “anchor” method was replaced by the Weighted Nearest Neighbors (WNN) method.
First, we load in the provided matrix of ATACseq peaks and collapse the peak matrix to a “gene activity matrix”. Here, we make the simplifying assumption that a gene’s activity can be quantified by simply summing all counts within the gene body + 2kb upstream. Next we build the Seurat object and store the original peaks as “ATAC” assay. As a QC step, we also filter out all cells here with fewer than 10K total counts in the scATAC-seq data.
library("Seurat")
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following object is masked from 'package:base':
##
## intersect
library("ggplot2")
library("Signac")
####################################################################################
#Gene activity quantification (ATACseq peaks)
peaks <- Read10X_h5("data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
dim(peaks)
## [1] 89796 8728
peaks[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## AAACGAAAGAGCGAAA-1 AAACGAAAGAGTTTGA-1 AAACGAAAGCGAGCTA-1
## chr1:565107-565550 . . .
## chr1:569174-569639 . . .
## chr1:713460-714823 . . 2
## chr1:752422-753038 . . .
## chr1:762106-763359 . . 4
## AAACGAAAGGCTTCGC-1 AAACGAAAGTGCTGAG-1
## chr1:565107-565550 . .
## chr1:569174-569639 . .
## chr1:713460-714823 8 .
## chr1:752422-753038 . .
## chr1:762106-763359 2 .
#Here we summed up all peaks overlapping a gene and computed of ATACseq peaks per gene (here we just load an already pre-computed matrix for the sake of time)
activity.matrix <- read.delim("data/scatacseq_activity_matrix.txt", header = TRUE, row.names = 1, sep="\t")
colnames(activity.matrix)<-gsub("\\.","-",colnames(activity.matrix))
dim(activity.matrix)
## [1] 18969 8728
activity.matrix[1:5,1:5]
## AAACGAAAGAGCGAAA-1 AAACGAAAGAGTTTGA-1 AAACGAAAGCGAGCTA-1
## RP5-857K21.4 0 0 0
## RP11-206L10.2 0 0 2
## RP11-206L10.10 0 0 0
## LINC01128 0 0 4
## FAM41C 0 0 0
## AAACGAAAGGCTTCGC-1 AAACGAAAGTGCTGAG-1
## RP5-857K21.4 0 0
## RP11-206L10.2 8 0
## RP11-206L10.10 0 0
## LINC01128 3 0
## FAM41C 4 0
#Seurat object setup
pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC")
pbmc.atac@assays$ATAC@layers$counts[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . 2 8 .
## [4,] . . . . .
## [5,] . . 4 2 .
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
pbmc.atac
## An object of class Seurat
## 108765 features across 8728 samples within 2 assays
## Active assay: ATAC (89796 features, 0 variable features)
## 1 layer present: counts
## 1 other assay present: ACTIVITY
meta<-read.table("data/atac_v1_pbmc_10k_singlecell.csv",sep=",",header=TRUE,row.names=1,stringsAsFactors=FALSE)
head(meta)
## total duplicate chimeric unmapped lowmapq mitochondrial
## NO_BARCODE 8335327 3922818 71559 746476 634014 22972
## AAACGAAAGAAACGCC-1 3 1 0 1 0 0
## AAACGAAAGAAAGCAG-1 14 1 0 4 2 0
## AAACGAAAGAAAGGGT-1 7 1 0 1 0 0
## AAACGAAAGAAATACC-1 9880 5380 79 106 1120 6
## AAACGAAAGAAATCTG-1 2 0 0 0 0 0
## passed_filters cell_id is__cell_barcode TSS_fragments
## NO_BARCODE 2937488 None 0 0
## AAACGAAAGAAACGCC-1 1 None 0 1
## AAACGAAAGAAAGCAG-1 7 None 0 0
## AAACGAAAGAAAGGGT-1 5 None 0 2
## AAACGAAAGAAATACC-1 3189 None 0 386
## AAACGAAAGAAATCTG-1 2 None 0 1
## DNase_sensitive_region_fragments enhancer_region_fragments
## NO_BARCODE 0 0
## AAACGAAAGAAACGCC-1 1 1
## AAACGAAAGAAAGCAG-1 2 0
## AAACGAAAGAAAGGGT-1 2 0
## AAACGAAAGAAATACC-1 1623 297
## AAACGAAAGAAATCTG-1 2 1
## promoter_region_fragments on_target_fragments
## NO_BARCODE 0 0
## AAACGAAAGAAACGCC-1 0 1
## AAACGAAAGAAAGCAG-1 0 2
## AAACGAAAGAAAGGGT-1 1 2
## AAACGAAAGAAATACC-1 107 1758
## AAACGAAAGAAATCTG-1 1 2
## blacklist_region_fragments peak_region_fragments
## NO_BARCODE 0 0
## AAACGAAAGAAACGCC-1 0 1
## AAACGAAAGAAAGCAG-1 0 0
## AAACGAAAGAAAGGGT-1 0 1
## AAACGAAAGAAATACC-1 16 262
## AAACGAAAGAAATCTG-1 0 2
meta <- meta[colnames(pbmc.atac), ]
pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta)
pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 10000)
pbmc.atac$tech <- "atac"
rm(activity.matrix, peaks, meta)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6915709 369.4 13183080 704.1 13183080 704.1
## Vcells 175058137 1335.6 483114493 3685.9 579092784 4418.2
Next we filter the individual data sets and visualize the individual scRNAseq and scATACseq data sets. Here we perform Latent Semantic Indexing (LSI)) to reduce the dimensionality of the scATAC-seq data down to 30 dimensions. This procedure learns an ‘internal’ structure for the scRNA-seq data, and is important when determining the appropriate weights for the anchors when transferring information. We utilize Latent Semantic Indexing (LSI) to learn the structure of ATAC-seq data, as proposed in Cusanovich et al, Science 2015. LSI is implemented here by performing computing the term frequency-inverse document frequency (TF-IDF) followed by SVD. We use all peaks that have at least 1000 reads across all cells. We also include a pre-processed scRNAseq PBMC cells data set that was used in many other Seurat vignettes as a benchmark data set. We exclude the first dimension as this is typically correlated with sequencing depth.
#Preprocessing scATACseq per gene (activity martix), we will use it later in order to find anchors between cells in the scATAC-seq dataset and the scRNA-seq dataset.
DefaultAssay(pbmc.atac) <- "ACTIVITY"
pbmc.atac <- FindVariableFeatures(pbmc.atac)
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac)
## Centering and scaling data matrix
#Preprocessing raw peaks
DefaultAssay(pbmc.atac) <- "ATAC"
VariableFeatures(pbmc.atac) <- names(which(Matrix::rowSums(pbmc.atac) > 1000))
pbmc.atac <- RunTFIDF(pbmc.atac)
## Performing TF-IDF normalization
pbmc.atac <- FindTopFeatures(pbmc.atac, min.cutoff = "q0")
pbmc.atac <- RunSVD(pbmc.atac)
## Running SVD
## Scaling cell embeddings
pbmc.atac <- RunTSNE(pbmc.atac, reduction = "lsi", dims = 2:30)
#Reading scRNAseq data set
pbmc.rna <- readRDS("data/pbmc_10k_v3.rds")
pbmc.rna = UpdateSeuratObject(pbmc.rna)
## Validating object structure
## Updating object slots
## Ensuring keys are in the proper structure
## Updating matrix keys for DimReduc 'pca'
## Updating matrix keys for DimReduc 'tsne'
## Updating matrix keys for DimReduc 'umap'
## Ensuring keys are in the proper structure
## Ensuring feature names don't have underscores or pipes
## Updating slots in RNA
## Updating slots in RNA_nn
## Setting default assay of RNA_nn to RNA
## Updating slots in RNA_snn
## Setting default assay of RNA_snn to RNA
## Updating slots in pca
## Updating slots in tsne
## Setting tsne DimReduc to global
## Updating slots in umap
## Setting umap DimReduc to global
## Setting assay used for NormalizeData.RNA to RNA
## Setting assay used for FindVariableFeatures.RNA to RNA
## Setting assay used for ScaleData.RNA to RNA
## Setting assay used for RunPCA.RNA to RNA
## Setting assay used for RunTSNE.pca to RNA
## Setting assay used for FindNeighbors.RNA.pca to RNA
## No assay information could be found for FindClusters
## Setting assay used for RunUMAP.RNA.pca to RNA
## Validating object structure for Assay 'RNA'
## Validating object structure for Graph 'RNA_nn'
## Validating object structure for Graph 'RNA_snn'
## Validating object structure for DimReduc 'pca'
## Validating object structure for DimReduc 'tsne'
## Validating object structure for DimReduc 'umap'
## Object representation is consistent with the most current Seurat version
dim(pbmc.rna@assays$RNA@counts)
## [1] 19089 9432
pbmc.rna@assays$RNA@counts[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## rna_AAACCCAAGCGCCCAT-1 rna_AAACCCACAGAGTTGG-1 rna_AAACCCACAGGTATGG-1
## AL627309.1 . . .
## AL669831.5 . . .
## FAM87B . . .
## LINC00115 . . .
## FAM41C . . .
## rna_AAACCCACATAGTCAC-1 rna_AAACCCACATCCAATG-1
## AL627309.1 . .
## AL669831.5 . .
## FAM87B . .
## LINC00115 . .
## FAM41C . .
pbmc.rna$tech <- "rna"
#Plotting scATACseq and scRNAseq next to each other
p1 <- DimPlot(pbmc.atac, reduction = "tsne") + NoLegend() + ggtitle("scATAC-seq") #here we do not have cell annotation, we will predict it using scRNAseq
p2 <- DimPlot(pbmc.rna, reduction = "tsne", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("scRNA-seq")
p1 + p2
Now, we can identify anchors between the scATAC-seq dataset and the scRNA-seq dataset and use these anchors to transfer the celltype labels we learned from the 10K scRNA-seq data to the scATAC-seq cells, i.e. we will apply the transferring anchors Seurat algorithm and visualize the individual scRNAseq and scATACseq data sets after they have been harmonized.
#Transfer anchors
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7117070 380.1 13183080 704.1 13183080 704.1
## Vcells 357474934 2727.4 579817391 4423.7 579092784 4418.2
transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna),
reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 16387 anchors
celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$celltype, weight.reduction = pbmc.atac[["lsi"]], dims = 2:30)
## Finding integration vectors
## Finding integration vector weights
## Predicting cell labels
pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)
#We can then examine the distribution of prediction scores
#and optionally filter out those cells with low scores.
#Here, we find that over 95% of the cells receive a score of 0.5 or greater.
hist(pbmc.atac$prediction.score.max)
abline(v = 0.5, col = "red")
table(pbmc.atac$prediction.score.max > 0.5)
##
## FALSE TRUE
## 371 5509
#Visualizing individual scRNAseq and scATACseq after their alignment
pbmc.atac.filtered <- subset(pbmc.atac, subset = prediction.score.max > 0.5)
pbmc.atac.filtered$predicted.id <- factor(pbmc.atac.filtered$predicted.id, levels = levels(pbmc.rna)) # to make the colors match (although it does not really work, seems to be broken in Seurat package)
p1 <- DimPlot(pbmc.atac.filtered, reduction = "tsne", group.by = "predicted.id", label = TRUE, repel = TRUE) + ggtitle("scATAC-seq cells") + NoLegend() + scale_colour_hue(drop = FALSE)
p2 <- DimPlot(pbmc.rna, reduction = "tsne", group.by = "celltype", label = TRUE, repel = TRUE) + ggtitle("scRNA-seq cells") + NoLegend()
p1 + p2
Finally we will perform co-embedding and tSNE visualization of the scRNAseq and scATACseq Omics in their common space after the integration has been done. They demonstrate very encouraging overlapping. Here, we use the same anchors used earlier to transfer cell type labels to impute RNA-seq values for the scATAC-seq cells. We then merge the measured and imputed scRNA-seq data and run a standard tSNE analysis to visualize all the cells together. In order to perform co-embedding, we first ‘impute’ RNA expression into the scATAC-seq cells based on the previously computed anchors, and then merge the datasets.
#Co-embedding
# note that we restrict the imputation to variable genes from scRNA-seq, but could impute the full transcriptome if we wanted to
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7213684 385.3 13183080 704.1 13183080 704.1
## Vcells 587524916 4482.5 1002215650 7646.3 695222615 5304.2
genes.use <- VariableFeatures(pbmc.rna)
refdata <- GetAssayData(pbmc.rna, assay = "RNA", slot = "data")[genes.use, ]
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# refdata (input) contains a scRNA-seq expression matrix for the scRNA-seq cells. imputation (output) will contain an imputed scRNA-seq matrix for each of the ATAC cells
imputation <- TransferData(anchorset = transfer.anchors, refdata = refdata, weight.reduction = pbmc.atac[["lsi"]], dims = 2:30)
## Finding integration vectors
## Finding integration vector weights
## Transfering 3000 features onto reference data
# this line adds the imputed data matrix to the pbmc.atac object
pbmc.atac[["RNA"]] <- imputation
coembed <- merge(x = pbmc.rna, y = pbmc.atac)
rm(pbmc.rna, pbmc.atac)
# Finally, we run PCA and UMAP on this combined object, to visualize the co-embedding of both datasets
coembed <- ScaleData(coembed, features = genes.use, do.scale = FALSE)
## Centering data matrix
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7217386 385.5 13183080 704.1 13183080 704.1
## Vcells 724185068 5525.1 1202738780 9176.2 991672700 7565.9
coembed <- RunPCA(coembed, features = genes.use, verbose = FALSE)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7219005 385.6 13183080 704.1 13183080 704.1
## Vcells 725122806 5532.3 1202738780 9176.2 991672700 7565.9
coembed <- RunTSNE(coembed, dims = 1:30)
coembed$celltype <- ifelse(!is.na(coembed$celltype), coembed$celltype, coembed$predicted.id)
p1 <- DimPlot(coembed, group.by = "tech", reduction = "tsne") + ggtitle("scRNAseq + scATACseq")
p2 <- DimPlot(coembed, group.by="celltype", label=TRUE, repel=TRUE, reduction="tsne") + NoLegend() + ggtitle("Annotated consensus")
p1 + p2
We conclude that scATACseq and scRNAseq demonstrate a very nice overlapping implying that the function of the cell clusters can be explain from both transcriptomic and epigenetic point of view, i.e. those two layers of information are complementary and one can be used instead of the other one to enhance the signal in case of experiemntal technical failure for any group of cells.
rm(list = ls())
Weighted Nearest Neighbor (WNN) approach for single cell data integration across multiple modalities represents a recent development on the Seurat workflow published in Hao et al., Cell in 2021. The WNN method is based on constructing KNN graphs in individual modalities and intersecting the graphs to get a consensus across modalities graph and a UMAP low dimensional representation of the graph.
Here we are going to utilize WNN for the CITEseq technology on PBMC cells that we have previously used to demonstrate principles of unsupervised OMICs integration through UMAP and Autoencoder. We use the CITE-seq dataset from (Stuart and Butler et al., Cell 2019), which consists of 30672 scRNAseq profiles measured alongside a panel of 25 antibodies (we refer to it as ADT or scProteomics) from bone marrow. The workflow consists of three steps:
Here we will start with loading the CITEseq data into memory:
library("Seurat")
library("SeuratData")
library("cowplot")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Load the CITEseq data data
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7222280 385.8 13183080 704.1 13183080 704.1
## Vcells 377666700 2881.4 1202738780 9176.2 991672700 7565.9
InstallData("bmcite")
## Installing package into '/home/jovyan/R/x86_64-conda-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
bm <- LoadData(ds = "bmcite")
The gene expression and protein abundance matrices can be accesed as follows:
bm@assays$RNA$counts[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1
## FO538757.2 . . .
## AP006222.2 . . .
## RP4-669L17.10 . . .
## RP11-206L10.9 . . .
## LINC00115 . . .
## a_AAACCTGCAAACCTAC-1 a_AAACCTGCAAGGTGTG-1
## FO538757.2 . .
## AP006222.2 . .
## RP4-669L17.10 . .
## RP11-206L10.9 . .
## LINC00115 . .
dim(bm@assays$RNA)
## [1] 17009 30672
bm@assays$ADT@counts[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1
## CD11a 110 541 120
## CD11c 51 12 10
## CD123 14 5 1
## CD127-IL7Ra 20 187 43
## CD14 15 7 4
## a_AAACCTGCAAACCTAC-1 a_AAACCTGCAAGGTGTG-1
## CD11a 645 2469
## CD11c 42 884
## CD123 5 36
## CD127-IL7Ra 115 18
## CD14 10 438
dim(bm@assays$ADT)
## [1] 25 30672
Next we will perform data pre-processing, normalization and dimension reduction on each modality independently. For scRNAseq we will use library size normalization followed by log-transform, while total sum scaling (TSS) normalization across cells followed by the CLR normalization is recommended by Seurat for ADT data (as they can be viewed as compositional data after TSS procedure):
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7273756 388.5 13183080 704.1 13183080 704.1
## Vcells 473261237 3610.7 1202738780 9176.2 991672700 7565.9
DefaultAssay(bm) <- 'RNA'
bm <- NormalizeData(bm) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
## Centering and scaling data matrix
## PC_ 1
## Positive: TRBC1, LAT, CD8B, CCL5, KLRB1, IGKC, S100A12, GZMA, S100A8, S100A9
## MS4A1, S100B, GNLY, CST7, TYROBP, KLRD1, RP11-291B21.2, NKG7, VCAN, CD14
## IGLC2, CCL4, AC092580.4, FCN1, IGLC3, PRF1, RBP7, SERPINA1, DUSP2, JUN
## Negative: KIAA0101, TYMS, KLF1, KCNH2, FAM178B, APOC1, CNRIP1, CENPU, GATA1, BIRC5
## CENPF, EPCAM, CKS2, RP11-620J15.3, TUBA1B, TFR2, CA1, HMGA1, STMN1, HIST1H4C
## CDT1, AHSP, TOP2A, TK1, GFI1B, TUBB, MKI67, NME4, SMIM1, TMEM56
## PC_ 2
## Positive: RPL3, RPS3, RPS18, RPS5, RPS4X, RPSA, RPS12, RPS23, RPS2, EEF1B2
## RPL4, LDHB, NPM1, RPS17, RPLP0, TRBC1, LAT, RPL7A, GYPC, HSPA8
## CD8B, KLRB1, CCL5, HNRNPA1, PEBP1, RPL37A, MYC, NUCB2, SOD1, CD79A
## Negative: LYZ, FCN1, CST3, TYROBP, S100A9, LST1, S100A8, CSTA, MNDA, VCAN
## LGALS1, AIF1, S100A12, CFD, SERPINA1, FCER1G, MS4A6A, FOS, S100A6, CD14
## LGALS2, FTH1, GAPDH, ANXA2, CD36, CPVL, RBP7, HLA-DRA, LINC01272, H3F3A
## PC_ 3
## Positive: CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, IGLL1, ITM2C, SMIM24, RPS23, FABP5, HLA-DRB5
## TCF4, RPS18, PLD4, HLA-DQA1, RPS4X, SOX4, SPINK2, RPLP1, KIAA0125, HLA-DQB1
## HLA-DRB1, RPS2, PRSS57, IRF8, CDCA7, C1QTNF4, STMN1, BCL11A, NREP, RPS24
## Negative: GYPA, ALAS2, SLC4A1, AHSP, HBM, GYPB, CA2, HBD, RHAG, CA1
## HBA1, TMEM56, SELENBP1, HBB, HBA2, MYL4, HEMGN, HMBS, RHCE, HBQ1
## DMTN, RFESD, SPTA1, FECH, KLF1, ANK1, CTSE, SMIM1, TSPO2, SLC25A37
## PC_ 4
## Positive: CD79A, MS4A1, CD79B, CD74, HLA-DRA, HLA-DPB1, HLA-DPA1, IGHD, HLA-DQA1, HLA-DQB1
## BANK1, VPREB3, SPIB, TCL1A, HLA-DRB5, FAM129C, LINC00926, IGHM, IGKC, HLA-DRB1
## TNFRSF13C, JCHAIN, TSPAN13, IRF8, FCER2, CD24, BLK, CD22, GNG7, FCRLA
## Negative: NKG7, CST7, PRSS57, GNLY, GZMA, KLRB1, CCL5, LDHB, KLRD1, CMC1
## CYTL1, PRF1, EGFL7, LYAR, TRBC1, KLRF1, GAPDH, S100A6, NGFRAP1, LAT
## RPS3, S100B, FGFBP2, GZMH, GATA2, RPS24, NPW, CCL4, SMIM24, CDK6
## PC_ 5
## Positive: RPS2, RPS18, RPS12, RPS23, RPL37A, RPLP1, RPL4, RPS5, RPS4X, CNRIP1
## RPS17, APOC1, NPM1, MYC, EEF1B2, FAM178B, EPCAM, KCNH2, KLF1, RPLP0
## RPL7A, CYTL1, MS4A1, LDHB, SMIM10, GATA1, APOE, RPS3, TFR2, RPL3
## Negative: NKG7, GZMB, CST7, GNLY, GZMA, KLRD1, CMC1, KLRF1, PRF1, CCL4
## CCL5, GZMH, CLIC3, FGFBP2, SPON2, C12orf75, TRDC, KLRB1, HOPX, XCL2
## CD160, IL2RB, PLAC8, FCGR3A, TRGC2, KLRG1, DUSP2, RHOC, PLEK, LAIR2
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7274104 388.5 13183080 704.1 13183080 704.1
## Vcells 536305892 4091.7 1202738780 9176.2 1053999630 8041.4
DefaultAssay(bm) <- 'ADT'
# we will use all ADT features for dimensional reduction, we set a dimensional reduction name to avoid overwriting
VariableFeatures(bm) <- rownames(bm[["ADT"]])
bm <- NormalizeData(bm, normalization.method = 'CLR', margin = 2) %>% ScaleData() %>% RunPCA(reduction.name = 'apca')
## Normalizing across cells
## Centering and scaling data matrix
## PC_ 1
## Positive: CD3, CD28, CD27, CD127-IL7Ra, CD278-ICOS, CD4, CD8a, CD161, CD25, CD45RO
## CD45RA, CD197-CCR7
## Negative: HLA.DR, CD11c, CD14, CD38, CD69, CD123, CD11a, CD34, CD19, CD79b
## CD16, CD56
## PC_ 2
## Positive: CD45RO, CD11a, CD14, CD4, CD11c, CD28, CD3, CD69, CD278-ICOS, CD127-IL7Ra
## CD38, CD161
## Negative: CD45RA, CD19, CD79b, CD197-CCR7, CD8a, CD57, CD56, CD16, CD27, CD25
## HLA.DR, CD34
## PC_ 3
## Positive: CD16, CD56, CD8a, CD11a, CD57, CD161, CD45RA, CD38, CD3, CD127-IL7Ra
## CD27, CD11c
## Negative: CD19, CD79b, CD4, CD25, HLA.DR, CD197-CCR7, CD69, CD34, CD28, CD45RO
## CD14, CD278-ICOS
## PC_ 4
## Positive: CD161, CD25, CD45RO, CD56, CD16, CD4, CD79b, CD19, CD28, CD57
## CD127-IL7Ra, HLA.DR
## Negative: CD8a, CD27, CD14, CD45RA, CD3, CD11c, CD69, CD38, CD197-CCR7, CD11a
## CD34, CD123
## PC_ 5
## Positive: CD4, CD38, CD123, CD16, CD34, CD45RA, CD56, CD278-ICOS, CD28, CD197-CCR7
## CD27, CD3
## Negative: CD45RO, CD8a, CD69, CD161, CD79b, CD19, CD25, CD57, CD14, CD11a
## CD127-IL7Ra, HLA.DR
Now we are going to apply the Mutual Nearest Neighbour (MNN) algorithm for intersecting the two constructed graphs for the individual Omics. For each cell, we calculate its closest neighbors in the dataset based on a weighted combination of RNA and protein similarities. We specify the dimensionality of each modality, which is similar to specifying the number of PCs to include in scRNA-seq clustering.
# Identify multimodal neighbors. These will be stored in the neighbors slot, and can be accessed using bm[['weighted.nn']]
# The WNN graph can be accessed at bm[["wknn"]], and the SNN graph used for clustering at bm[["wsnn"]]
# Cell-specific modality weights can be accessed at bm$RNA.weight
#bm.weight <- FindModalityWeights(object = bm, reduction.list = list("pca", "apca"),dims.list = list(1:30, 1:18))
#bm.weight@first.modality.weight
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7277759 388.7 13183080 704.1 13183080 704.1
## Vcells 537879846 4103.7 1202738780 9176.2 1053999630 8041.4
bm <- FindMultiModalNeighbors(bm, reduction.list = list("pca", "apca"), dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight")
## Calculating cell-specific modality weights
## Finding 20 nearest neighbors for each modality.
## Calculating kernel bandwidths
## Finding multimodal neighbors
## Constructing multimodal KNN graph
## Constructing multimodal SNN graph
Now we will run UMAP on the two independent modalities as well as on the WNN graph constructed after overlapping both Omics, i.e. on the data based on a weighted combination of RNA and protein data. We can also perform graph-based clustering and visualize these results on the UMAP, alongside a set of cell annotations.
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7286297 389.2 13183080 704.1 13183080 704.1
## Vcells 537836762 4103.4 1202738780 9176.2 1053999630 8041.4
#Individual Omics
bm <- RunUMAP(bm, reduction = 'pca', dims = 1:30, assay = 'RNA', reduction.name = 'rna.umap', reduction.key = 'rnaUMAP_')
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 09:53:56 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 09:53:56 Read 30672 rows and found 30 numeric columns
## 09:53:56 Using Annoy for neighbor search, n_neighbors = 30
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 09:53:56 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 09:53:59 Writing NN index file to temp file /tmp/Rtmpko5Jt0/filee25307f1713
## 09:53:59 Searching Annoy index using 1 thread, search_k = 3000
## 09:54:10 Annoy recall = 100%
## 09:54:11 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 09:54:12 Initializing from normalized Laplacian + noise (using RSpectra)
## 09:54:14 Commencing optimization for 200 epochs, with 1427826 positive edges
## 09:54:33 Optimization finished
bm <- RunUMAP(bm, reduction = 'apca', dims = 1:18, assay = 'ADT', reduction.name = 'adt.umap', reduction.key = 'adtUMAP_')
## 09:54:33 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 09:54:33 Read 30672 rows and found 18 numeric columns
## 09:54:33 Using Annoy for neighbor search, n_neighbors = 30
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 09:54:33 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 09:54:36 Writing NN index file to temp file /tmp/Rtmpko5Jt0/filee2528a1087
## 09:54:36 Searching Annoy index using 1 thread, search_k = 3000
## 09:54:47 Annoy recall = 100%
## 09:54:48 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 09:54:51 Initializing from normalized Laplacian + noise (using RSpectra)
## 09:54:52 Commencing optimization for 200 epochs, with 1341598 positive edges
## 09:55:11 Optimization finished
#Consensus graph
bm <- RunUMAP(bm, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
## 09:55:11 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 09:55:12 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 20
## 09:55:14 Initializing from normalized Laplacian + noise (using RSpectra)
## 09:55:15 Commencing optimization for 200 epochs, with 983534 positive edges
## 09:55:35 Optimization finished
bm <- FindClusters(bm, graph.name = "wsnn", algorithm = 3, resolution = 2, verbose = FALSE)
Finally, we will visualize both Omics and the consunsus UMAP plot:
p1 <- DimPlot(bm, reduction='rna.umap', group.by='celltype.l2', label=TRUE, repel=TRUE, label.size=2.5) + NoLegend() + ggtitle("scRNAseq")
p2 <- DimPlot(bm, reduction='adt.umap', group.by='celltype.l2', label=TRUE, repel=TRUE, label.size=2.5) + NoLegend() + ggtitle("scProteomics")
p1 + p2
p3 <- DimPlot(bm, reduction="wnn.umap", group.by="celltype.l2", label=TRUE, label.size=2.5, repel=TRUE) + NoLegend() + ggtitle("WNN")
p3
We can see that the CD8 Naive T-cells that were overlapping with the CD4 naive T-cells using scRNAseq data alone, can now be clearly distinguishable due to the overlapping with the ADT scProteomics data.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/local/miniconda3/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bmcite.SeuratData_0.3.0 dplyr_1.1.3 cowplot_1.1.1
## [4] SeuratData_0.2.1 Signac_1.12.0 ggplot2_3.4.4
## [7] Seurat_5.0.1 SeuratObject_5.0.1 sp_2.1-1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_1.8.7 magrittr_2.0.3
## [4] spatstat.utils_3.0-3 farver_2.1.1 rmarkdown_2.25
## [7] zlibbioc_1.48.0 vctrs_0.6.4 ROCR_1.0-11
## [10] Rsamtools_2.18.0 spatstat.explore_3.2-3 RCurl_1.98-1.12
## [13] RcppRoll_0.3.0 htmltools_0.5.6.1 sass_0.4.7
## [16] sctransform_0.4.1 parallelly_1.36.0 KernSmooth_2.23-22
## [19] bslib_0.5.1 htmlwidgets_1.6.2 ica_1.0-3
## [22] plyr_1.8.9 plotly_4.10.2 zoo_1.8-12
## [25] cachem_1.0.8 igraph_1.5.1 mime_0.12
## [28] lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.6-3
## [31] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.11
## [34] fitdistrplus_1.1-11 future_1.33.0 shiny_1.7.5.1
## [37] digest_0.6.33 colorspace_2.1-0 patchwork_1.1.3
## [40] S4Vectors_0.40.2 tensor_1.5 RSpectra_0.16-1
## [43] irlba_2.3.5.1 GenomicRanges_1.54.1 labeling_0.4.3
## [46] progressr_0.14.0 fansi_1.0.5 spatstat.sparse_3.0-2
## [49] httr_1.4.7 polyclip_1.10-6 abind_1.4-5
## [52] compiler_4.3.1 bit64_4.0.5 withr_2.5.1
## [55] BiocParallel_1.36.0 fastDummies_1.7.3 MASS_7.3-60
## [58] rappdirs_0.3.3 tools_4.3.1 lmtest_0.9-40
## [61] httpuv_1.6.11 future.apply_1.11.0 goftest_1.2-3
## [64] glue_1.6.2 nlme_3.1-163 promises_1.2.1
## [67] grid_4.3.1 Rtsne_0.16 cluster_2.1.4
## [70] reshape2_1.4.4 generics_0.1.3 hdf5r_1.3.8
## [73] gtable_0.3.4 spatstat.data_3.0-1 tidyr_1.3.0
## [76] data.table_1.14.8 XVector_0.42.0 utf8_1.2.4
## [79] BiocGenerics_0.48.1 spatstat.geom_3.2-5 RcppAnnoy_0.0.21
## [82] ggrepel_0.9.4 RANN_2.6.1 pillar_1.9.0
## [85] stringr_1.5.0 spam_2.9-1 RcppHNSW_0.5.0
## [88] later_1.3.1 splines_4.3.1 lattice_0.22-5
## [91] bit_4.0.5 survival_3.5-7 deldir_1.0-9
## [94] tidyselect_1.2.0 Biostrings_2.70.1 miniUI_0.1.1.1
## [97] pbapply_1.7-2 knitr_1.44 gridExtra_2.3
## [100] IRanges_2.36.0 scattermore_1.2 stats4_4.3.1
## [103] xfun_0.40 matrixStats_1.0.0 stringi_1.7.12
## [106] lazyeval_0.2.2 yaml_2.3.7 evaluate_0.22
## [109] codetools_0.2-19 tibble_3.2.1 cli_3.6.1
## [112] uwot_0.1.16 xtable_1.8-4 reticulate_1.39.0
## [115] munsell_0.5.0 jquerylib_0.1.4 Rcpp_1.0.11
## [118] GenomeInfoDb_1.38.1 globals_0.16.2 spatstat.random_3.1-6
## [121] png_0.1-8 parallel_4.3.1 ellipsis_0.3.2
## [124] dotCall64_1.1-0 bitops_1.0-7 listenv_0.9.0
## [127] viridisLite_0.4.2 scales_1.2.1 ggridges_0.5.4
## [130] crayon_1.5.2 leiden_0.4.3 purrr_1.0.2
## [133] rlang_1.1.1 fastmatch_1.1-4